Mini-Project 02: Business of Show Business

In this paper, we will analyze the IMDB data to answer some questions, design a measurement framework for evaluating performance and identify an opportunity for investment. #### Data Prep
In this section, we obtain and prepare data for analysis. Because of the memory and performance issues, we will use small files and further down-select data to enable a more fluid analysis. We will drop titles with fewer than 100 ratings and individuals who worked on only 1 title.
Installing and Loading Libraries

Show the code
# Installing and loading libraries

if (!require("tidyverse")) install.packages("tidyverse")
library(tidyverse)
if (!require("DT")) install.packages("DT")
library(DT)
if (!require("dplyr")) install.packages("dplyr")
library(dplyr)
if (!require("psych")) install.packages("psych")
library(psych)
if (!require("gt")) install.packages("gt")
library(gt)
if (!require("formattable")) install.packages("formattable")
library(formattable)
if (!require("sqldf")) install.packages("sqldf")
library(sqldf)
if (!require("plotly")) install.packages("plotly")
library(plotly)
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
if (!require("RColorBrewer")) install.packages("RColorBrewer")
library(RColorBrewer)

Reading data in

Show the code
## read files in and create dataframes

name_basics <- read.csv("name_basics_small.csv")
title_basics <- read.csv("title_basics_small.csv")
title_episodes <- read.csv("title_episodes_small.csv")
title_ratings <- read.csv("title_ratings_small.csv")
title_crew <- read.csv("title_crew_small.csv")
title_principals <- read.csv("title_principals_small.csv")

# drop records with fewer than 2 titles from name_basics df

name_basics <- name_basics |>
  filter(str_count(knownForTitles, ",") > 1)

# drop records with fewer than 100 ratings from title_ratings df

title_ratings <- title_ratings |>
  filter(numVotes >= 100)

Furthermore, to ensure consistency across all data sets, we will apply the same filtering, i.e., excluding titles with fewer than 100 ratings, to the rest of the title tables:

Show the code
# filtering title basics df

title_basics <- title_basics |>
  semi_join(
    title_ratings,
    join_by(tconst == tconst)
  )

# filtering title crew df

title_crew <- title_crew |>
  semi_join(
    title_ratings,
    join_by(tconst == tconst)
  )

# filtering title episodes df on title id

title_episodes_1 <- title_episodes |>
  semi_join(
    title_ratings,
    join_by(tconst == tconst)
  )

# filtering title episodes df on parent title id

title_episodes_2 <- title_episodes |>
  semi_join(
    title_ratings,
    join_by(parentTconst == tconst)
  )

# combining filtered title episodes dfs

title_episodes <- bind_rows(
  title_episodes_1,
  title_episodes_2
) |>
  distinct()

## filtering title principals df

title_principals <- title_principals |>
  semi_join(
    title_ratings,
    join_by(tconst == tconst)
  )

# remove dfs we no longer need

rm(title_episodes_1)
rm(title_episodes_2)

Task 1

Correct the column types of the title tables using a combination of mutate and the coercion functions as.numeric and as.logical.

Title Basics

glimpse(title_basics)
Rows: 372,198
Columns: 9
$ tconst         <chr> "tt0000001", "tt0000002", "tt0000003", "tt0000004", "tt…
$ titleType      <chr> "short", "short", "short", "short", "short", "short", "…
$ primaryTitle   <chr> "Carmencita", "Le clown et ses chiens", "Pauvre Pierrot…
$ originalTitle  <chr> "Carmencita", "Le clown et ses chiens", "Pauvre Pierrot…
$ isAdult        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ startYear      <chr> "1894", "1892", "1892", "1892", "1893", "1894", "1894",…
$ endYear        <chr> "\\N", "\\N", "\\N", "\\N", "\\N", "\\N", "\\N", "\\N",…
$ runtimeMinutes <chr> "1", "5", "5", "12", "1", "1", "1", "1", "45", "1", "1"…
$ genres         <chr> "Documentary,Short", "Animation,Short", "Animation,Come…

Columns startYear,endYear and runtimeMinutes are formatted as character/string in the original data set and need to be changed to be numeric.

## recode column types and rename columns

title_basics <- title_basics |>
  mutate(
    startYear = as.numeric(startYear),
    endYear = as.numeric(endYear),
    runtimeMinutes = as.numeric(runtimeMinutes)
  ) |>
  rename(
    start_year = startYear,
    end_year = endYear,
    runtime_minutes = runtimeMinutes
  )

glimpse(title_basics)
Rows: 372,198
Columns: 9
$ tconst          <chr> "tt0000001", "tt0000002", "tt0000003", "tt0000004", "t…
$ titleType       <chr> "short", "short", "short", "short", "short", "short", …
$ primaryTitle    <chr> "Carmencita", "Le clown et ses chiens", "Pauvre Pierro…
$ originalTitle   <chr> "Carmencita", "Le clown et ses chiens", "Pauvre Pierro…
$ isAdult         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ start_year      <dbl> 1894, 1892, 1892, 1892, 1893, 1894, 1894, 1894, 1894, …
$ end_year        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ runtime_minutes <dbl> 1, 5, 5, 12, 1, 1, 1, 1, 45, 1, 1, 1, 1, 1, 2, 1, 1, 1…
$ genres          <chr> "Documentary,Short", "Animation,Short", "Animation,Com…

Title Crew

glimpse(title_crew)
Rows: 371,902
Columns: 3
$ tconst    <chr> "tt0000001", "tt0000002", "tt0000003", "tt0000004", "tt00000…
$ directors <chr> "nm0005690", "nm0721526", "nm0721526", "nm0721526", "nm00056…
$ writers   <chr> "\\N", "\\N", "\\N", "\\N", "\\N", "\\N", "\\N", "\\N", "nm0…

There is no need to correct any data types here.

Title Episodes

glimpse(title_episodes)
Rows: 3,007,178
Columns: 4
$ tconst        <chr> "tt0045960", "tt0046855", "tt0048378", "tt0048562", "tt0…
$ parentTconst  <chr> "tt0044284", "tt0046643", "tt0047702", "tt0047768", "tt0…
$ seasonNumber  <chr> "2", "1", "1", "1", "1", "1", "1", "1", "1", "3", "3", "…
$ episodeNumber <chr> "3", "4", "6", "10", "4", "20", "5", "2", "20", "6", "2"…

seasonNumber and episodeNumber columns need to be converted to numeric format.

## recode column types and rename columns

title_episodes <- title_episodes |>
  mutate(
    seasonNumber = as.numeric(seasonNumber),
    episodeNumber = as.numeric(episodeNumber)
  ) |>
  rename(
    season_number = seasonNumber,
    episode_number = episodeNumber
  )

glimpse(title_episodes)
Rows: 3,007,178
Columns: 4
$ tconst         <chr> "tt0045960", "tt0046855", "tt0048378", "tt0048562", "tt…
$ parentTconst   <chr> "tt0044284", "tt0046643", "tt0047702", "tt0047768", "tt…
$ season_number  <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 1, 8, 1, 10, 6, 2, 8, …
$ episode_number <dbl> 3, 4, 6, 10, 4, 20, 5, 2, 20, 6, 2, 3, 2, 10, 17, 5, 1,…

Title Principals

glimpse(title_principals)
Rows: 6,586,689
Columns: 6
$ tconst     <chr> "tt0000001", "tt0000001", "tt0000001", "tt0000001", "tt0000…
$ ordering   <int> 1, 2, 3, 4, 1, 2, 1, 2, 3, 4, 5, 1, 2, 1, 2, 3, 1, 2, 3, 4,…
$ nconst     <chr> "nm1588970", "nm0005690", "nm0005690", "nm0374658", "nm0721…
$ category   <chr> "self", "director", "producer", "cinematographer", "directo…
$ job        <chr> "\\N", "\\N", "producer", "director of photography", "\\N",…
$ characters <chr> "[\"Self\"]", "\\N", "\\N", "\\N", "\\N", "\\N", "\\N", "\\…

There is no need to correct data types here.

Title Ratings

glimpse(title_ratings)
Rows: 372,198
Columns: 3
$ tconst        <chr> "tt0000001", "tt0000002", "tt0000003", "tt0000004", "tt0…
$ averageRating <dbl> 5.7, 5.6, 6.5, 5.4, 6.2, 5.0, 5.4, 5.4, 5.4, 6.8, 5.2, 7…
$ numVotes      <int> 2090, 283, 2094, 184, 2828, 196, 889, 2233, 214, 7699, 3…

There is no need to correct data types here.

Name Basics

glimpse(name_basics)
Rows: 2,460,608
Columns: 6
$ nconst            <chr> "nm0000001", "nm0000002", "nm0000003", "nm0000004", …
$ primaryName       <chr> "Fred Astaire", "Lauren Bacall", "Brigitte Bardot", …
$ birthYear         <chr> "1899", "1924", "1934", "1949", "1918", "1915", "189…
$ deathYear         <chr> "1987", "2014", "\\N", "1982", "2007", "1982", "1957…
$ primaryProfession <chr> "actor,miscellaneous,producer", "actress,soundtrack,…
$ knownForTitles    <chr> "tt0072308,tt0050419,tt0053137,tt0027125", "tt003738…

birthYear and deathYear columns need to be formatted as numeric.

## recode column types and rename columns

name_basics <- name_basics |>
  mutate(
    birthYear = as.numeric(birthYear),
    deathYear = as.numeric(deathYear)
  ) |>
  rename(
    birth_year = birthYear,
    death_year = deathYear
  )

glimpse(name_basics)
Rows: 2,460,608
Columns: 6
$ nconst            <chr> "nm0000001", "nm0000002", "nm0000003", "nm0000004", …
$ primaryName       <chr> "Fred Astaire", "Lauren Bacall", "Brigitte Bardot", …
$ birth_year        <dbl> 1899, 1924, 1934, 1949, 1918, 1915, 1899, 1924, 1925…
$ death_year        <dbl> 1987, 2014, NA, 1982, 2007, 1982, 1957, 2004, 1984, …
$ primaryProfession <chr> "actor,miscellaneous,producer", "actress,soundtrack,…
$ knownForTitles    <chr> "tt0072308,tt0050419,tt0053137,tt0027125", "tt003738…

Task 2 - Instructor-Provided Questions

Q1. How many movies are in our data set? How many TV series? How many TV episodes?

To answer this question, we will use the title basics data set, which contains release and production information.

# get a count of records by content types

df1 <- title_basics |>
  group_by(titleType) |>
  summarize(number_of_records = n()) |>
  ungroup() |>
  mutate(number_of_records = comma(number_of_records, digits = 0)) |>
  rename(title_type = titleType) |>
  arrange(desc(number_of_records))

# plot the resulting df

fig_content_count_type <- plot_ly(
  data = df1,
  y = ~ reorder(title_type, number_of_records),
  x = ~number_of_records,
  type = "bar",
  orientation = "h",
  marker = list(color = "cerulean"),
  width = 500,
  height = 300
)

fig_content_count_type <- fig_content_count_type |>
  layout(
    title = "Number of Titles by Content Type",
    xaxis = list(title = "Number of Records"),
    yaxis = list(title = "")
  )


fig_content_count_type

Content type is captured in the titleType column. We have 131,662 movies, 29,789 TV Series and 155,722 TV episodes.

Q2. Who is the oldest living person in our data set?

To answer this question, we will use the name basics table, which has birth and death records. However, a quick examination of data highlights certain irregularities in death records. It appears that we are missing actual death records for a number of individuals who, despite being born prior to the 20th century, are appear to still be alive.

# list living persons by year of birth

# Subset of data - 10 oldest presumably living persons 
name_basics |>
  filter(is.na(death_year) & !is.na(birth_year)) |>
  arrange(birth_year) |>
  head(10) |>
  gt()
nconst primaryName birth_year death_year primaryProfession knownForTitles
nm5671597 Robert De Visée 1655 NA composer,soundtrack tt2219674,tt1743724,tt0441074,tt14426058
nm7807390 William Sandys 1767 NA composer,soundtrack tt4396584,tt3747572,tt4555594,tt0071007
nm1441282 Richard Dybeck 1811 NA soundtrack tt0021783,tt0022126,tt0036372,tt0037562
nm6711738 Albert Monnier 1815 NA writer tt0329972,tt3966780,tt6793558,tt15175930
nm1227803 C. Hostrup 1818 NA writer,composer,actor tt0031361,tt0134089,tt0844680,tt14463014
nm1329526 Edouard Martin 1825 NA writer tt0200268,tt0329972,tt3966780,tt0036496
nm1197286 Ion Ivanovici 1845 NA composer,soundtrack tt0043412,tt0040391,tt1324061,tt0083697
nm0179107 Attilio Corbell 1850 NA actor tt0009508,tt0009121,tt0182770,tt0007472
nm0843185 André Sylvane 1850 NA writer tt0019480,tt0155273,tt0159028,tt0167460
nm0242243 Charles Dungan 1853 NA actor tt0267008,tt0008259,tt0008876,tt0003634
# create a df with records of living persons

df3 <- name_basics |>
  filter(is.na(death_year) & !is.na(birth_year)) |>
  group_by(birth_year) |>
  summarise(number_of_records = n()) |>
  ungroup() |>
  arrange(birth_year)

# plot the resulting df

fig_cnt_living_persons <- plot_ly(
  data = df3,
  x = ~birth_year,
  y = ~number_of_records,
  type = "bar",
  marker = list(color = "cerulean"),
  width = 500,
  height = 300
)

fig_cnt_living_persons <- fig_cnt_living_persons |>
  layout(
    title = "Living Persons by Year of Birth",
    xaxis = list(title = "Year of Birth"),
    yaxis = list(title = "Count of Living Persons")
  )

fig_cnt_living_persons

Since we can’t manually verify verify hundreds of questionable records, we will have to use a rule-based approach to answer this question. The oldest verified person to have ever lived was 122 years and 164 days at the time of death so using this age as a threshold, we can filter out all individuals born after 1902, which leaves us with 65 individuals born in 1903.

# list count of living persons by year of birth

name_basics |>
  filter(is.na(death_year) & !is.na(birth_year) & birth_year > 1902) |>
  group_by(birth_year) |>
  summarize(number_of_records = n()) |>
  ungroup() |>
  arrange(birth_year) |>
  head(5) |>
  gt()
birth_year number_of_records
1903 65
1904 77
1905 68
1906 83
1907 78
# list living persons born in 1903
df4 <- name_basics |>
  filter(birth_year == 1903 & is.na(death_year)) |>
  select(primaryName, birth_year, death_year) |>
  arrange(primaryName)

sample_n(df4, 65) |>
  DT::datatable()
Q3. There is one TV Episode in this data set with a perfect 10/10 rating and 200,000 IMDb ratings. What is it? What series does it belong to?

To answer this question, we need to use 3 data sets, title ratings,title basics and title episodes.

# create df with list of all TV episodes

list_tv_epis <- title_basics |>
  filter(titleType == "tvEpisode") |>
  select(tconst, titleType, primaryTitle)

# create df with list of all TV series

list_tv_series <- title_basics |>
  filter(titleType == "tvSeries") |>
  select(tconst, titleType, primaryTitle)

# create df with records of tv episodes

tv_ep_df1 <- inner_join(list_tv_epis, title_episodes, by = "tconst")

# join ratings data

tv_ep_df2 <- inner_join(tv_ep_df1, title_ratings, by = "tconst")

# find a TV episode meeting criteria

tv_ep_df3 <- tv_ep_df2 |>
  filter((numVotes >= 200000) & (averageRating == 10))

# map tv series name

tv_ep_ratings_df <- inner_join(tv_ep_df3, list_tv_series, by = c("parentTconst" = "tconst"))

# rename columns in the resulting df
tv_ep_ratings_df |>
  rename(
    episode_id = tconst,
    average_rating = averageRating,
    number_of_ratings = numVotes,
    title_type = titleType.x,
    episode_title = primaryTitle.x,
    series_id = parentTconst,
    series_name = primaryTitle.y,
    parent_title_type = titleType.y
  ) |>
  gt()
episode_id title_type episode_title series_id season_number episode_number average_rating number_of_ratings parent_title_type series_name
tt2301451 tvEpisode Ozymandias tt0903747 5 14 10 227589 tvSeries Breaking Bad

The TV episode with the perfect 10/10 rating and over 200K reviews is Ozymandias ep.15 season 5 of the cult TV hit Breaking Bad.

Q4. What four projects is the actor Mark Hamill most known for?

To answer this question, we will use name basics and title basics data sets.

# get title records for mark hamill

mh_df <- name_basics |>
  filter(primaryName == "Mark Hamill") |>
  select(primaryName, knownForTitles) |>
  separate_longer_delim(knownForTitles, ",")

# map titles names and types on the list of selected content IDs

mh_df2 <- inner_join(mh_df, title_basics, by = c("knownForTitles" = "tconst"))
mh_df2 |>
  select(knownForTitles, titleType, primaryTitle, start_year) |>
  rename(
    title_id = knownForTitles,
    content_type = titleType,
    content_title = primaryTitle,
    year = start_year
  ) |>
  gt() |>
  tab_header(
    title = "Titles Mark Hamill is Known For"
  )
Titles Mark Hamill is Known For
title_id content_type content_title year
tt0076759 movie Star Wars: Episode IV - A New Hope 1977
tt2527336 movie Star Wars: Episode VIII - The Last Jedi 2017
tt0080684 movie Star Wars: Episode V - The Empire Strikes Back 1980
tt0086190 movie Star Wars: Episode VI - Return of the Jedi 1983

Mark Hamill is known for his roles in the Star Wars movies, where he first starred in 1977 and most recently in 2017.

Q5. What TV series, with more than 12 episodes, has the highest average rating?

To answer this question, we need 3 data sets - title_episodes, title ratings and title basics.

# we already have a df with all TV series - we created it in a previous question - list_tv_series

# create a df with records of tv series wirh all episodes

ep_filtered_series <- inner_join(title_episodes, list_tv_series, by = c("parentTconst" = "tconst"))

# df with tv series with 12+ episodes

series_num_epis <- ep_filtered_series |>
  group_by(parentTconst, primaryTitle, titleType) |>
  summarise(num_episodes = n()) |>
  ungroup() |>
  arrange(desc(num_episodes)) |>
  filter(num_episodes >= 12)


datatable(series_num_epis)

We have over 20K TV series with 12 or more episodes.

# join tv episodes and series data with ratings data

ep_filtered_series_ratings <- inner_join(ep_filtered_series,
  title_ratings,
  by = "tconst"
)

# drop all tv series with fewer than 12 episodes

ep_filtered_series_ratings2 <- inner_join(ep_filtered_series_ratings,
  series_num_epis,
  by = "parentTconst"
)

# calculate average ratings for tv series

ep_filtered_series_ratings2 |>
  group_by(parentTconst, primaryTitle.x) |>
  summarise(avg_rating = mean(averageRating)) |>
  ungroup() |>
  rename(
    tv_series_id = parentTconst,
    tv_series_title = primaryTitle.x
  ) |>
  arrange(desc(avg_rating)) |>
  head(5) |>
  gt() |>
  tab_header(
    title = "Top 5 TV Series by Average Rating",
    subtitle = "TV series with 12 or more episodes only"
  )
Top 5 TV Series by Average Rating
TV series with 12 or more episodes only
tv_series_id tv_series_title avg_rating
tt0409579 Made 10.0
tt11363282 The Real Housewives of Salt Lake City 10.0
tt21278628 Cowboys of Thunder 10.0
tt0060008 The Milton Berle Show 9.9
tt0168358 Parkinson 9.9

There are 3 TV series that obtained the perfect 10/10 rating - ‘Made’,‘The Real Housewives of Salt Lake City’ and ‘Cowboys of Thunder’.

Q6. Is it true that episodes from later seasons of Happy Days have lower average ratings than the early seasons?

To answer this question, we will use title basics,title episodes and title ratings data sets:

# create df for TV series 'Happy Days'

hd_df1 <- title_basics |>
  filter(primaryTitle == "Happy Days" & titleType == "tvSeries")

# join HD df with detailed TV episodes data
hd_detail <- inner_join(title_episodes, hd_df1, by = c("parentTconst" = "tconst"))

# join ratings data to detailed Happy Days records

hd_detail_ratings <- inner_join(hd_detail, title_ratings, by = "tconst")

datatable(hd_detail_ratings)

Now that we have detailed records on all episodes of the Happy Days TV series, we can calculate the average rating for each season.

# create df with average rating by season

avg_hd_detail_ratings <- hd_detail_ratings |>
  group_by(season_number) |>
  summarise(avg_rating_season = mean(averageRating)) |>
  ungroup() |>
  arrange(season_number)

# plot the resulting df

fig_hd_seasons <- plot_ly(
  data = avg_hd_detail_ratings,
  x = ~season_number,
  y = ~avg_rating_season,
  type = "bar",
  marker = list(color = "cerulean"),
  width = 500,
  height = 300
)

fig_hd_seasons <- fig_hd_seasons |>
  layout(title = "Happy Days - Average Rating by Season",
         xaxis = list(title = "Season #"),
         yaxis = list(title = "Average Rating")
    )

fig_hd_seasons

It appears that the earlier seasons of the series indeed had higher average ratings compared to the more recent seasons.

Task 3

Design a ‘success’ measure for IMDb entries, reflecting both quality and broad popular awareness.

As we found in Q1 in Task1, movies constitute the absolute majority of records in our data - 131.6K records vs 29.8K for TV series, the next largest category of content. We do not include TV episode in this analysis as TV episodes are not a standalone content.

# plot number of records by content type from the earlier question

fig_content_count_type

Given the obvious differences in production, marketing, and audience appeal, we will focus on movies for this part of the exercise.
Let’s start with creating a data frame with ratings data for movies.

# create df with list of all movies

list_movies <- title_basics |>
  filter(titleType == "movie") |>
  select(tconst, titleType, primaryTitle, start_year, genres, runtime_minutes, isAdult)

# join with ratings data

movie_ratings_df <- inner_join(list_movies, title_ratings, by = "tconst")

movie_ratings_df2 <- movie_ratings_df |>
  rename(
    title = primaryTitle,
    title_id = tconst,
    content_type = titleType,
    year = start_year,
    average_rating = averageRating,
    number_of_votes = numVotes
  )

# sample movie df

sample_n(movie_ratings_df2, 1000) |>
  DT::datatable()

Next we will conduct an explanatory data analysis on our movies data set to better understand the two ratings metrics.

# subset metrics

movie_ratings_df2_metrics <- movie_ratings_df2 |>
  select(average_rating, number_of_votes)

# describe metrics

summary(movie_ratings_df2_metrics)
 average_rating   number_of_votes  
 Min.   : 1.000   Min.   :    100  
 1st Qu.: 5.200   1st Qu.:    195  
 Median : 6.100   Median :    459  
 Mean   : 5.923   Mean   :   8694  
 3rd Qu.: 6.800   3rd Qu.:   1664  
 Max.   :10.000   Max.   :2942823  
# histogram of average ratings

# plot a histogram of number of ratings in plotly
avg_ratings_x <- movie_ratings_df2$average_rating

fig_hist_avg_ratings <- plot_ly(
  x = avg_ratings_x,
  type = "histogram",
  nbinsx = 100,
  marker = list(color = "cerulean")
) |>
  layout(
    title = "Distribution of Average Movie Ratings",
    xaxis = list(title = "Average Rating"),
    yaxis = list(title = "Frequency")
  )

fig_hist_avg_ratings
# histogram of average ratings

# plot a histogram of number of ratings in plotly

num_ratings_x <- movie_ratings_df2$number_of_votes

fig_distr_number_ratings <- plot_ly(
  x = num_ratings_x,
  type = "histogram",
  nbinsx = 80,
  marker = list(color = "cerulean")
) |>
  layout(
    title = "Distribution of Movie Ratings",
    xaxis = list(title = "Number of Ratings"),
    yaxis = list(type = "log", title = "Frequency (Log-Scaled)")
  )

fig_distr_number_ratings

Looking at descriptive statistics and statistical plots, we can see that most titles have relatively high average ratings. 50% of all titles have a rating above 6.1, and top 25% of titles have a rating over 6.8. Distribution of number of ratings, on the other hand, has a right skew, meaning that we have only a handful of titles with a very high number of votes.
Since we need to design a blended performance metric, we need to account for quality and popularity of a title simultaneously which can be done by an averaging of these two metrics. Before we proceed, we need to standardize the data to account for differences in magnitude and distribution of ratings and votes variables:

# calculate mean and standard deviation for ratings and votes data

movie_ratings_df3 <- movie_ratings_df2 |>
  mutate(
    avg_ratings_movies = mean(average_rating),
    avg_number_ratings = mean(number_of_votes),
    sd_avg_ratings = sd(average_rating),
    sd_number_ratings = sd(number_of_votes)
  )

sample_n(movie_ratings_df3, 1000) |>
  DT::datatable(options = list(
    pageLength = 5
  ))

Now we can create standardized metrics for ratings and votes, as well as the blended performance index reflecting the quality of the movie (via average rating) and the popularity of the movie (via number of ratings), with equal weight given to each input.

# create standardized metrics for votes and ratings

movie_ratings_df4 <- movie_ratings_df3 |>
  mutate(
    score_rating = round((average_rating - avg_ratings_movies) / sd_avg_ratings, 2),
    score_votes = round((number_of_votes - avg_number_ratings) / sd_number_ratings, 2),
    performance_index = round((score_rating + score_votes) / 2, 2)
  )

sample_n(movie_ratings_df4, 1000) |>
  DT::datatable(options = list(
    pageLength = 5
  ))
# descriptive statistics for performance index

movie_ratings_df4_pi <- movie_ratings_df4 |>
  select(performance_index)

summary(movie_ratings_df4_pi)
 performance_index  
 Min.   :-1.990000  
 1st Qu.:-0.360000  
 Median : 0.030000  
 Mean   : 0.000264  
 3rd Qu.: 0.310000  
 Max.   :27.490000  
# histogram of performance index

pi_x2 <- movie_ratings_df4$performance_index

fig7 <- plot_ly(
  x = pi_x2,
  type = "histogram",
  nbinsx = 200,
  marker = list(color = "blue")
) |>
  layout(
    title = "Distribution of Movie Performance Indices",
    xaxis = list(title = "Performance Index"),
    yaxis = list(title = "Frequency")
  )

fig7
# % of titles with negative PI

movie_ratings_df4_pi |>
  summarise(
    titles_with_negative_pi = sum(performance_index < 0),
    all_titles = n()
  ) |>
  mutate(share_of_titles_with_negative_pi = round(titles_with_negative_pi / all_titles, 2)) |>
  gt()
titles_with_negative_pi all_titles share_of_titles_with_negative_pi
61673 131662 0.47

Performance index penalizes titles with subpar, i.e., below average, popularity and/or quality. 47% of movies in our data set have negative performance index.

Performance Index Validation

1.Choose the top 5-10 movies on your metric and confirm that they were indeed box office successes.

# top 5 movies

mrdf <- movie_ratings_df4 |>
  select(title, year, genres, average_rating, number_of_votes, performance_index)

mrdf |>
  arrange(performance_index) |>
  slice_max(performance_index, n = 5) |>
  gt() |>
  tab_header(
    title = "Top 5 Movies by Peformance Index"
  )
Top 5 Movies by Peformance Index
title year genres average_rating number_of_votes performance_index
The Shawshank Redemption 1994 Drama 9.3 2942823 27.49
The Dark Knight 2008 Action,Crime,Drama 9.0 2922922 27.20
Inception 2010 Action,Adventure,Sci-Fi 8.8 2595555 24.20
Fight Club 1999 Drama 8.8 2374722 22.23
Forrest Gump 1994 Drama,Romance 8.8 2301630 21.57

Among top 5 movies based on performance index, four (with the exception of The Shawshank Redemption) were commercial successes, and The Shawshank Redemption is still widely considered to be one of the beloved and most critically acclaimed movies of all times.

  1. Choose 3-5 movies with large numbers of IMDb votes that score poorly on your success metric and confirm that they are indeed of low quality.
# add this to top line to change plot size: , fig.width=4,fig.height=4}


# plot ratings and votes data

gfig <- ggplot(data = movie_ratings_df2, aes(x = average_rating, y = number_of_votes)) +
  geom_point(size = 1, color = "blue") +
  labs(
    title = "Movie Quality (Average Rating) and Popularity (Number of Ratings)",
    x = "Average Rating",
    y = "Number of Ratings"
  ) +
  theme_minimal() +
  theme_bw() +
  scale_x_log10(label = scales::comma) +
  scale_y_log10(label = scales::comma) 


gfig

As seen on this chart, we should have a decent number of movies with average rating of 1-2 and 80K-100K number of ratings, so we will look up titles meeting these criteria:

movie_ratings_df4 |>
  filter(average_rating < 3 & number_of_votes >= 75000) |>
  arrange(desc(performance_index)) |>
  select(title, year, genres, average_rating, number_of_votes, performance_index) |>
  gt()
title year genres average_rating number_of_votes performance_index
Radhe 2021 Action,Crime,Thriller 1.9 180205 -0.04
Adipurush 2023 Action,Adventure,Drama 2.7 133981 -0.13
Meet the Spartans 2008 Comedy,Fantasy 2.8 112199 -0.29
Epic Movie 2007 Adventure,Comedy,Fantasy 2.4 110222 -0.47
Battlefield Earth 2000 Action,Adventure,Sci-Fi 2.5 83786 -0.66
Dragonball Evolution 2009 Action,Adventure,Fantasy 2.5 80118 -0.70
Disaster Movie 2008 Comedy,Sci-Fi 1.9 95170 -0.80
Justin Bieber: Never Say Never 2011 Documentary,Music 1.7 76466 -1.04
Sadak 2 2020 Action,Drama 1.2 96825 -1.06

Indeed, these movies score very poorly on the performance index, and while they have a relatively large volume of ratings, they also have low average ratings.

  1. Choose a prestige actor or director and confirm that they have many projects with high scores on your success metric.

Steven Spielberg, one of the most famous and successful directors of our time, has 4 very successful projects with performance index of ranging from 4.65 to 14.54, which puts these titles in top 1% of all movies in our data set.

# get title records for Steven Spielberg

bp_df <- name_basics |>
  filter(primaryName == "Steven Spielberg") |>
  select(primaryName, knownForTitles) |>
  separate_longer_delim(knownForTitles, ",")

# map titles names and types on the list of selected content IDs

bp_df2 <- inner_join(bp_df, title_basics, by = c("knownForTitles" = "tconst"))
bp_df3 <- bp_df2 |>
  select(primaryName, knownForTitles, titleType, primaryTitle) |>
  rename(
    name = primaryName,
    title_id = knownForTitles,
    content_type = titleType,
    content_title = primaryTitle
  )

#select performance index and title
movie_pi_df <- movie_ratings_df4 |>
  select(title_id, average_rating, number_of_votes, performance_index)

# join to SS records
bp_df4 <- inner_join(bp_df3, movie_pi_df, by = "title_id")
datatable(bp_df4)
# percentiles for performance index

quantile(movie_ratings_df4$performance_index, probs = c(0,0.125,0.375,0.625,0.875,0.9,0.95,0.99,1))
   0% 12.5% 37.5% 62.5% 87.5%   90%   95%   99%  100% 
-1.99 -0.67 -0.16  0.16  0.51  0.58  0.78  2.00 27.49 
  1. Perform at least one other form of ‘spot check’ validation.

Avatar, the highest-grossing movie of all times ($2.9B worldwide gross) has a performance index of 13.2, which puts it in top 1% of our data set.

#select performance index and title

movie_ratings_df4 |>
  select(title,genres,year, average_rating, number_of_votes, performance_index) |>
  filter((title=="Avatar") & (year==2009)) |>
  gt()
title genres year average_rating number_of_votes performance_index
Avatar Action,Adventure,Fantasy 2009 7.9 1402915 13.2
  1. Come up with a numerical threshold for a project to be a ‘success’; that is, determine a value such that movies above are all “solid” or better.
# percentiles for performance index

quantile(movie_ratings_df4$performance_index, probs = c(0,0.10,0.20,0.40,0.50,0.60,0.80,0.95,1))
   0%   10%   20%   40%   50%   60%   80%   95%  100% 
-1.99 -0.74 -0.47 -0.12  0.03  0.12  0.38  0.78 27.49 

We will use 0.38 (top 20% score cutoff) as a threshold of success - titles with performance index of 0.38 or higher are high performers.

Task 5: Key Personnel

Identify (at least) two actors and one director who you will target as the key talent for your movie. Write a short “pitch” as to why they are likely to be successful. You should support your pitch with at least one graphic and one table.

Since we are going to be developing a documentary title, we need to adjust this question a bit and identify a director-writer-producer team as opposed to a director-actors team.

#get a list of titles in documentary genre, made after 1970 , with sufficient level of awareness and high performance index and map director and writer info

doc_df1<-sqldf(
"
    with a as(
    select title_id,
    title,
    decade,
    performance_index,
    average_rating,
    number_of_votes,
    success_flag
    from movie_ratings_df4
    where 1=1
    and genres='Documentary'
    and year>=1970
   and success_flag=1
and number_of_votes>=5000
    )
    select a.*,
    t.directors,
    t.writers,
    n.primaryName as director_name,
    n2.primaryName as writer_name,
    tb.start_year
    from a
    inner join title_crew t
    on a.title_id=t.tconst
    inner join name_basics n
    on directors=n.nconst
    inner join name_basics n2
    on t.writers=n2.nconst
    inner join title_basics tb
    on a.title_id=tb.tconst
    order by a.performance_index desc, a.number_of_votes desc, a.average_rating desc
    
    ;
  "
  )  
  
datatable(doc_df1)
doc_df2<-sqldf(
"
select director_name,
writer_name,
director_name||'-'||writer_name as movie_team,
count(title_id) as cnt_movies,
avg(performance_index) as avg_performance_index,
avg(average_rating) as avg_rating,
avg(number_of_votes) as avg_number_of_ratings
from doc_df1
group by 1,2,3
having count(title_id)>1
order by 4 desc
    ;
  "
  )

doc_df2
   director_name   writer_name                  movie_team cnt_movies
1  Werner Herzog Werner Herzog Werner Herzog-Werner Herzog          3
2  Michael Moore Michael Moore Michael Moore-Michael Moore          2
3 Sophie Fiennes  Slavoj Zizek Sophie Fiennes-Slavoj Zizek          2
  avg_performance_index avg_rating avg_number_of_ratings
1                  0.59       7.30              14358.33
2                  0.69       7.35              24127.50
3                  0.68       7.70               7500.00

Looking at the high-performing documentaries from 1970s - present, 3 film makers have produced multiple successful titles: Werner Herzog, Michael Moore and the director-writer duo of Sophie Fiennes and Slavoj Zizek. Since we need to identify a team for our next project, we propose to approach the Fiennes-Zizek duo as they have already demonstrated they can successfully work together, which might not be the case for established solo creators Moore and Herzog.

# plot movie team data


fig_movie_team<-plot_ly(data=doc_df2,
                        x = ~avg_rating,
                        y = ~avg_number_of_ratings,
                        type = 'scatter',
                        mode = 'markers',
                        color=~movie_team
                        #,
                        # width = 500,
                        #height = 500
                        ) |>
   layout(
    title = "Movie Team Performance Comparison",
    xaxis = list(title = "Quality (Average Rating)"),
    yaxis = list(
      title = "Popularity (Number of Ratings)"),
    legend = list(
    orientation = 'h',  # Horizontal legend 
    x = 0.5,            # Center horizontally
    xanchor = 'center', # Align center
    y = -0.2
    #,           # Position below the plot
   # font = list(size =8 )  # Smaller font size
    )
  )
 
fig_movie_team

Titles produced by Moore and Herzog appear to have a higher awareness among viewers but Fiennes-Zizek work is not far behind, and a more polarizing topic and a targeted marketing and PR campaign can help address this slight shortcoming.

Task 6: Finding a Classic Movie to Remake

Find a classic movie to remake with your key talent. The original should have a large number of IMDb ratings, a high average rating, and not have been remade in the past 25 years.

When looking at the top documentary titles, __Super Size Me_ is a definite outlier: >Super Size Me premiered at the 2004 Sundance Film Festival, where Morgan Spurlock won the Grand Jury Prize for directing the film.The film opened in the US on May 7, 2004, and grossed a total of $11,536,423 worldwide, making it the 7th highest-grossing documentary film of all time.It was nominated for an Academy Award for Best Documentary Feature and won the award for Best Documentary Screenplay from the Writers Guild of America.

(Source)

A 2017 title Super Size Me 2: Holy Chicken! from the same director also performed reasonably well, even in the light of certain issues with with publicity and distribution. It’s important to note that this film was not a remake of a original title as it was focused on the process of opening a fast-food restaurant. (Source)
Given the success of the 2004 ‘Super Size Me’ and increasing popularity of the semaglutide drugs, we should consider making a documentary about a weight loss journey and impact of taking this medicine on one’s life, health and mind - a ‘Super Size Me’ journey in reverse. While this movie was released 20 years ago, cultural context, relevancy and timeliness play a huge role in documentary titles success, and for this topic the time is definitely now. Another reason to pursue this opportunity now is an unhappy one as Morgan Spurlock, the writer and director of both ‘Super Size Me’ titles, died in May of this year so re imagining his most famous work could serve as a tribute to Spurlock’s many talents and the impact his vision and creative genius left on our society. As a possible contributor to our project, We can consider Lee Fulkerson, who wrote and directed an award-winning and highly acclaimed documentary Forks Over Knives as he has already successfully explored the topic of self-improvement in his 2011 movie (performance index of 0.73).

Task 7: Write and Deliver Your Pitch

From Sophie Fiennes and Slavoj Zizek, the masters of philosophical and psychoanalytical exploration, and Lee Fulkerson, the visionary mind behind an inspiring story of human transformation, inspired by a critically acclaimed hit Super Size Me, comes the modern take on a timeless tale of metamorphosis, obsession and desire to be perfect at any cost. XXS Me: The Beginning coming to Netflix in December 2025.